PowFusion
=================
先对输入按线性变换，然后逐元素计算幂运算，支持指数广播。

.. math::

    \text{if broadcast: } output_i = (scale \times Input_i + shift)^{exponent_0} \\
    \text{else: } output_i = (scale \times Input_i + shift)^{exponent_i}

输入：
    - **Input** - 输入数据地址。
    - **exponent** - 指数数据地址；当 **broadcast** 为 True 时读取 `exponent[0]` 作为标量。
    - **length_in** - 输入长度。
    - **scale** - 线性变换比例系数。
    - **shift** - 线性变换偏移值。
    - **broadcast** - 是否将指数作为标量广播。
    - **core_mask(int, 可选)** - 核掩码（仅适用于共享存储版本）。

输出：
    - **output** - 计算结果地址。

支持平台：
    ``FT78NE``  
    ``MT7004``

.. note::
    - FT78NE 支持 fp32, fp64, int8, int16, int32  
    - MT7004 支持 fp32, fp16, int16, int32  
    - 当指数为整数（即 fabs(exp - (int)exp) < 1e-6）时，内部使用优化的整数次幂实现以提高性能。  
    - 对于负底数与非整数指数或其它非法值（如 0^negative），结果可能为未定义或产生 NaN/Inf，上层应负责必要的数值检查与处理。

**共享存储版本:**

.. c:function:: void i8_pow_fusion_s(int8_t* Input, int8_t* exponent, int8_t* output, int length_in, int8_t scale, int8_t shift, bool broadcast, int core_mask)
.. c:function:: void i16_pow_fusion_s(int16_t* Input, int16_t* exponent, int16_t* output, int length_in, int scale, int shift, bool broadcast, int core_mask)
.. c:function:: void i32_pow_fusion_s(int32_t* Input, int32_t* exponent, int32_t* output, int length_in, int scale, int shift, bool broadcast, int core_mask)
.. c:function:: void hp_pow_fusion_s(half* Input, half* exponent, half* output, int length_in, float scale, float shift, bool broadcast, int core_mask)
.. c:function:: void fp_pow_fusion_s(float* Input, float* exponent, float* output, int length_in, float scale, float shift, bool broadcast, int core_mask)
.. c:function:: void dp_pow_fusion_s(double* Input, double* exponent, double* output, int length_in, double scale, double shift, bool broadcast, int core_mask)

    **C调用示例：**

    .. code-block:: c
        :linenos:
        :emphasize-lines: 14

        // FT78NE 共享存储示例
        #include <stdio.h>
        #include <stdbool.h>

        int main(int argc, char* argv[]) {
            float *input = (float *)0xA0000000;      // input 在 DDR 空间
            float *exponent = (float *)0xA0100000;   // exponent 在 DDR 空间（或标量）
            float *output = (float *)0xB0000000;
            int length_in = 1024;
            float scale = 1.0f;
            float shift = 0.0f;
            bool broadcast = false;
            int core_mask = 0xff;
            fp_pow_fusion_s(input, exponent, output, length_in, scale, shift, broadcast, core_mask);
            return 0;
        }


**私有存储版本:**

.. c:function:: void i8_pow_fusion_p(int8_t* Input, int8_t* exponent, int8_t* output, int length_in, int8_t scale, int8_t shift, bool broadcast)
.. c:function:: void i16_pow_fusion_p(int16_t* Input, int16_t* exponent, int16_t* output, int length_in, int scale, int shift, bool broadcast)
.. c:function:: void i32_pow_fusion_p(int32_t* Input, int32_t* exponent, int32_t* output, int length_in, int scale, int shift, bool broadcast)
.. c:function:: void hp_pow_fusion_p(half* Input, half* exponent, half* output, int length_in, float scale, float shift, bool broadcast)
.. c:function:: void fp_pow_fusion_p(float* Input, float* exponent, float* output, int length_in, float scale, float shift, bool broadcast)
.. c:function:: void dp_pow_fusion_p(double* Input, double* exponent, double* output, int length_in, double scale, double shift, bool broadcast)


    **C调用示例：**

    .. code-block:: c
        :linenos:
        :emphasize-lines: 13

        // MT7004 私有存储示例
        #include <stdio.h>
        #include <stdbool.h>

        int main(int argc, char* argv[]) {
            float *input = (float *)0x10000000;      
            float *exponent = (float *)0x10001000;   
            float *output = (float *)0x10002000;
            int length_in = 1024;
            float scale = 1.0f;
            float shift = 0.0f;
            bool broadcast = false;
            fp_pow_fusion_p(input, exponent, output, length_in, scale, shift, broadcast);
            return 0;
        }